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Abstract 

We propose a new method for mining frequent patterns in a language that combines both 
Semantic Web ontologies and rules. In particular we consider the setting of using a lan- 
guage that combines description logics with DL-safe rules. This setting is important for 
the practical application of data mining to the Semantic Web. We focus on the relation 
of the semantics of the representation formalism to the task of frequent pattern discovery, 
and for the core of our method, we propose an algorithm that exploits the semantics of the 
combined knowledge base. We have developed a proof-of-concept data mining implemen- 
tation of this. Using this we have empirically shown that using the combined knowledge 
base to perform semantic tests can make data mining faster by pruning useless candidate 
patterns before their evaluation. We have also shown that the quality of the set of patterns 
produced may be improved: the patterns are more compact, and there are fewer patterns. 
We conclude that exploiting the semantics of a chosen representation formalism is key to 
the design and application of (onto-)relational frequent pattern discovery methods. 

Note: To appear in Theory and Practice of Logic Programming (TPLP). 

KEYWORDS: frequent pattern discovery, ontologies. Semantic Web, DL-safe rules 



1 Introduction 



The discovery of frequent patterns is a fundamental data mining task. It has been 
studied for many different forms of input data and the pattern. Within the relational 



setting it has been investigated since the development of WARMR (Dehaspe and 



Toivonen 1999 1. WARMR uses the Datalog subset of first-order logic (FOL) as the 



representation language for both data and patterns. As such, WARMR, and other 



subsequently proposed relational frequent pattern miners, FARMER (Nijssen and 



Kok 2001 Nijssen and Kok 2003) and c-armr (de Raedt and Ramon 20041, can 
be classified as Inductive Logic Programming (ILP) ( Nienhuys-Cheng and de Wolf 



1997 Dzeroski and Lavrac 2001 1 methods. These ILP systems have been success- 



fully applied to a number of domains, most notably bioinformatics (King et al. 



J. Jozefowska and A. Lawrynowicz and T. Lukaszewski 



1998). 



2000a King et al. 2000b King et al. 2001) and chemoinformatics (Dehaspe et al. 



While relational frequent pattern mining methods have mostly assumed Datalog 
as the representation language, currently most activity within the field of knowl- 
edge representation (KR) assumes the use of logic-based ontology languages such as 



description logics (DLs) (Baader et al. 2003). Thanks to its significant support for 
modelling ontologies, and suitability to the inherently open and incomplete nature 
of the Web environment, description logic has been chosen as the formal founda- 
tion of the standard ontology language for the Web, the Web Ontology Language 



(OWL ) { McGuinness and van Harmelen 2004 1 . OWL is now considered one of the 



fundamental technologies underpinning the Semantic Web (Berners-Lee et al. 2001 ) 



currently one of the most active application fields of artificial intelligence. 

Research in KR is focused on developing deductive reasoning procedures, which 
are also traditionally employed to reason with logic-based ontological data. How- 
ever, to meet the challenges posed by the Semantic Web scale and use cases, such 
deductive approaches are not enough. Therefore, there is a recent trend in Semantic 
Web research to propose complementary forms of reasoning that are more efficient 
and noise-tolerant. A promising approach in this area is to use inductive methods to 
complement deductive ones. This is in line with the recent trends in ILP research to 
broaden the scope of the logical formalisms considered to description logics, or hy- 
brid languages combining description logics with logic programs. Since description 
logic knowledge bases are often equated with ontologies, ILP methods applied to 



such knowledge bases have been referred to as "ontology mining" methods (Fanizzi 



and d'Amato 2006 Fanizzi et al. 2008 d'Amato et al. 2008), and the ones applied 



to the hybrid knowledge bases to as "onto-relational mining" methods (Lisi and 



Esposito 2008). To the best of our knowledge, only one onto-relational frequent 



pattern mining method, SPADA (Lisi and Malerba 20041, has been proposed. 



This paper describes a method for frequent pattern mining in knowledge bases 



represented in the formalism of DL-safe rules (Motik et al. 2005) that combine 



Semantic Web ontologies (represented in description logic) and rules (represented 
in disjunctive Datalog). This language meets the requirements of knowledge rep- 
resentation for the Semantic Web and target application domains, and possesses 
properties suitable for data mining applications. In the core of the method, we 
propose an algorithm that exploits at various steps the semantics of a combined 
knowledge base on which it operates. We show how to realize the proposed method 
in terms of exploiting state-of-the-art reasoning techniques, and present a proof- 
of-concept implementation of the method. For Semantic Web research, the paper 
contributes to the general understanding of the role of ontologies and semantics in 
helping to solve knowledge-intensive tasks by exploiting the meaning of the repre- 
sented knowledge. For ILP data mining research, the method's main novel feature 
is its exploitation of the semantics of the chosen language. 

The rest of the paper is organized as follows. Section [2] disusses a technical and an 
application-oriented motivation of the work. In Section [3] we introduce the basics of 
knowledge representation formalisms considered in this paper, and the problem of 
frequent pattern mining from combined knowledge bases. In Section |4] we present 
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our method for mining frequent patterns. In Section [5] we present the experimen- 
tal evaluation of the proposed approach. Section [6] contains the discussion of the 
related work. Finally, Section [7] concludes the paper, and outlines future work. 



2 Motivation 



2.1 The Setting 

The problem of combining ontologies with rules is central in the Semantic Web. In 
the current stack of the Semantic Web languages, rules are placed in the same layer 
as ontologies. There is an ongoing initiative to define an open format for rule inter- 
change on the Semantic Web, the Rule Interchange Format (RIFJ^ that will cover 
a wide spectrum of rule types, among them deductive rules represented in Datalog. 
As we will discuss further in the paper, some important application domains such 
as life sciences require a language that combines description logic with some form 
of Datalog rules. 

Since a straightforward combination of DL and rules may easily lead to the un- 
decidability of reasoning problems, the problem of developing such combinations 
has received a lot of attention in KR and Semantic Web research. This has resulted 
in several proposals which may be generally divided into the following approaches: 
interaction of rules and ontologies with strict semantic separation (loose coupling), 
interaction of rules and ontologies with strict semantic integration (tight coupling), 
and reductions from DLs to logic programming formalisms. 



In the first approach, adopted by dl-programs (Eiter et al. 2004a Eiter et al. 



2004b Eiter et al. 2008), DL and rule components are technically separate, and 



can be seen as black boxes communicating via " safe interface" . 

In the second type of approach, "safe interaction", rules and DL knowledge bases 
are combined in a common semantic framework. A straightforward, tight extension 
of DL with first-order implication as proposed for Semantic Web Rule Language 
(SWRL) in ( Horrocks et al. 2004| ), is trivially undecidable. On the other hand. 
Description Logic Programs (DLP) (Grosof et al. 2003) describe a decidable inter- 



section of description logic and logic programs. In between of these two opposite 



approaches, there is a group of proposals such as AC-log (Donini et al. 1998) 



CARIN (Levy and Rousset 1998), DL-safe rules (Motik et al. 2005) or VC+log 



(Rosati 20061 where to obtain decidabihty, either DL, or rules or both are typically 



constrained by various syntactic restrictions, e.g. in the form of a safety condition. 
However, the usual syntactic restrictions may also be dropped, through changing 
the usual perspective of the integration from DLs to the perspective of rule-based 



systems, as proposed in (Lukasiewicz 2007) for the case of a tightly integrated form 



of disjunctive dl-programs. Finally, the tight integration may also become a full 



one, as in hybrid MKNF knowledge bases (Motik and Rosati 2007), where there is 



no separation between vocabularies of a DL and rule component. 



http://www.w3.org/2005/rules/wiki/RIF_Working_Group 
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An interesting representative approach for the works consisting on reducing de- 



scription logics to logic programming is proposed in (Hustadt et al. 2004 Hustadt 



et al. 20071 for an expressive DL language STilQ. In that approach, the consistency 



checking and query answering is reduced to the evaluation of a positive disjunctive 
Datalog program, which is obtained by a translation of a description logic knowl- 
edge base to first-order logic, followed by an application of superposition techniques, 
and the elimination of function symbols from the resulting set of clauses. 

Despite of the desired expressivity, there are also other important requirements 
for a language to be used in frequent pattern mining applications, which are data- 
intensive in nature. In the last decade, the focus of KR research has been mostly on 
developing reasoning techniques for handling complex DL intensional knowledge, 
and on decidability issues. However, new Semantic Web applications require effi- 
cient scalable procedures for query answering over ontologies, which now becomes 
an intensively explored area of research. Scalability may be achieved by restricting 
features of a DL language to obtain a lightweight one, but tailored for data-intensive 
applications, as in the case of a tractable family of lanuages called DL-Lite (^Cal-\ 



vanese et al. 2007). An interesting recent study into this direction is presented in 



( Call et al. 2009 1 , where a family of expressive extensions of Datalog is proposed that 
generalize the DL-Lite family, e.g. by admitting existentially quantified variables 
in rule heads. The requirement for efficient query answering over large amounts of 
data (extensional knowledge) is crucial for frequent pattern mining applications. 

Taking into account both criteria, that is sufficiently interesting expressivity re- 
quired for real applications, and efficient query answering procedures, one combina- 
tion of DLs and rules with interesting properties is the formalism of DL-safe rules 
(Motik et al. 2005 ). In this formalism decidability is obtained by restricting the rules 
to DL-safe ones that are applicable only to instances explicitly known by name. As 



it was shown in (Hustadt et al. 2004 Hustadt et al. 2007), the restriction to DL- 



safety enables the transformation of a DL knowledge base to a disjunctive Datalog 
program. This in turn enables the application of well-known reasoning algorithms 
and optimization techniques (such as magic-sets or join-order optimizations) de- 
veloped for deductive databases in order to handle large data quantities. Some of 



these methods have recently been extended for disjunctive Datalog ( Gumbo et al. 



2004). The algorithm proposed in (Motik et al. 2005) for query answering in DL 



with DL-safe rules separates reasoning on the intensional part of a knowledge base 
from that on the extensional part, which means that the inferences made on the in- 
tensional part are not repeated for different instances during query answering. This 
in turn enables better complexity results for the query answering algorithm than 
in case of the other state-of-the-art reasoning techniques developed for expressive 
DLs ( [Hustadt et al. 2005[ [Motik and Sattler 2006] ). It should be noted that, if the 
translation does not generate any disjunctive rules, then the algorithm applies the 
least fixpoint operator used to evaluate non-disjunctive Datalog programs. Since the 
consequences of the least fixpoint operator can be computed in polynomial time, 
an important feature of the algorithm is that its behaviour becomes tractable while 
it is applied for less expressive languages. 
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The discussed features make the chosen DL-safe rules formahsm suitable for the 
envisaged frequent pattern mining applications. 



2.2 Possible applications 

The primary motivation for our work is for the application of our method to real- 
world data-mining applications. Arguably the most extensive use of Semantic Web 
KR methods is in the domain of biology. Large amounts of data are increasingly 
becoming openly available and described using real-life ontologies, represented in 
Semantic Web languages, such as GO (Gene Ontology^or BioPax (biological path- 
way knowledge^ This opens up the possibility for interesting large-scale and real- 
world onto-relational data mining applications. 

Below we will describe why KR in biology requires a language able to model the 
existence of unknown entities, disjunctions, and arbitrary composition of relations, 
that is a language that combines description logic with some form of Datalog rules. 
These requirements are a domain specific motivation for our language selection. 

Information stored in biological knowledge bases is inherently incomplete. For 
example, in functional genomics "every protein has a function^\ but often this 
function is unknown. Similarly, it is known that certain genes exist because they 
encode known proteins, but the identity of these genes are unknown (so-called "lo- 
cally orphan" genes). The existence of entities with unknown identity can be easily 
represented in description logi(Q while it cannot be represented in Datalog. 

Another way of modelling incompletness is by use of disjunction, what is not 
expressible in Datalog. For example, disjunction may be used to describe that one 
instance of certain tertiary structure units must be present in a protein (" a classical 
tyrosine phosphatase has at least one low molecular weight phosphotyrosine or one 



tyrosine spcciGc with dual specificity p-domain" ( Stevens et al. 2007)). 

In general, DLs employ the open-world assumption (OWA) which seems suit- 
able for a domain characterized by information that is incomplete either due to the 
limits in the current state of knowledge or due to omissions common in curation 
processes. OWA is closely related to the monotonic form of reasoning, classically 
assumed in FOL. The monotonicity of reasoning in DLs is in line with the need 
for the knowledge held in scientific knowledge bases to only comprise information 
that is generally accepted and experimentally validated, and, which is reasonable 
to assume, will not be falsified. For example, one may state that "In E. coli K-12, 
the protein encoded by the gene ECK0647 when in inner membrane, facilitates the 
transport of glutamate from the periplasm to the cytoplasm.'r\ (iRuttenberg et al. 



20051. This statement does not provide the reference to a particular enzyme and 
any information whether the transport is active or passive. If we subsequently learn 
such information, this does not change any positive or negative conclusions. Since 

^ http://www.gcncontology.org 
^ http://www.biopax.org 

Protein |Z 3hasFunction 
^ inMembrane n inCytoplasm C A., ECK0647_Protein C. inCytoplasm, glutamateTransport C 

Bparticipant.ECK 0647 _Protein n inMembrane 
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it is characteristic of the current state of biology that much information is only 
known to a certain degree, the monotonicity of reasoning in DLs allows scientific 
knowledge bases to be extensible and evolve with scientific knowledge. 

However, while possessing features which are not available in Datalog, description 
logic also has limitations. It does not allow relations of arbitrary arity and arbi- 
trary composition of relations. Assume for example, that we would like to express 
that " whenever a metal ion is bound to a phosphatase which catalyses a dephos- 
phorylation of some protein, then this ion regulates the dephosphorylation of this 
protein" ( Stevens et al. 2007| p It requires modelling that a composition of relations 
implies another relation, which may be expressed in the form of a rule. 

The above arguments show the need for languages combining description logic 
with some form of rules in the discussed field, and the DL-safe rules formalism 
meets all the necessary requirements for expressivity. It is also interesting to note 
here, that in (Ruttenberg et al. 2005), while discussing problems concerned with 
using description logic for modelling metabolic pathways, it has been already ar- 
gued, there should be a way for modelling that certain axioms may be applied only 
if there are known instances of some class. The solution to check this could be by 
submitting a query with a DL-safety condition. 

An interesting sample application of frequent pattern mining in the field of bi- 
ology may be in functional genomics with the goal of the identification of frequent 
patterns in the amino acid sequence descriptions. The results would be further used 
to generate rules for predicting protein functional class. Such an approach would 
constitute an onto-relational upgrade of the relational data mining application al- 
ready proposed in the literature (King et al. 2000b). Another, novel application 
may be in metabolic pathways analysis. The goal of the application would be to 
identify common frequent pathways in human and other organisms that cause hu- 
man diseases. The results of such analysis would allow for targeted drug design. 

Despite of life sciences, frequent pattern mining applications on the Semantic 
Web data may be valuable in many other domains. Let us take e-business as an- 
other example. As rules provide a powerful business logic representation many use 
case^ provided for RIF are actually in this domain. Most value for e-business that 
combinations of description logic ontologies and rules may provide is in increasing 
interoperability. Description logic provide means for expressing common vocabular- 
ies and domain knowledge, while rules enable to explicitely express business policies. 
For example, annotation of product and service offerings with terms from common 
ontologies such as GoodRelations]^ may enable customers and enterprises an auto- 
matic search for suitable suppliers across the Web. Further, employing rules may 
enable to automatically express business relations between offerings and customers, 
and to express business policies such as " The discount for a customer who buys a 
product is 5 percent if the customer is premium and the product is regular 



® regulates{x, y) <— metal Jon (x), isBound(x, z), phosphatase{z) , catalyses{z, y), dephosphorylation{y) 

http://www.w3.org/TR/rif-ucr 
* http:/ /www. heppnetz.de/projects/goodrelations 
^ discount{x, y,percent5) <— premium{x), regularly), buys{x, y) 
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In e-business domain, an interesting sample application of frequent pattern min- 
ing may be in finding frequent customer buying behaviors to support personalisa- 
tion, recommendation services and targeted marketing. 

It should be stressed that for e-business applications, relatively lightweight on- 
tologies may be sufficient, but the need for combining them with rules is essential 
in this domain. 



3 Preliminaries 

3.1 Language of knowledge representation 

In this section we introduce the language of knowledge representation based on 
the formalism of DL-safe rules. Further in this paper we develop an algorithm for 
frequent pattern discovery in this language. DL-safe rules combine description logics 
with disjunctive Datalog, which we briefly recall below. 



3.1.1 Description logics 



Description logics (DLs) (Baader et al. 2003) are a family of knowledge repre 



sentation languages, specifically suited to represent terminological knowledge in a 
structured and formalized way. Two kinds of atomic symbols are distinguished in 
any description logic language: atomic concepts (denoted by A) and atomic roles 
(denoted by R and S). Atomic symbols are elementary descriptions from which we 
inductively build complex descriptions (denoted by C and D) using concept con- 
structors and role constructors. Description logics differ by the set of constructors 
they admit. 

DLs are equipped with a logic-based model-theoretic semantics. The semantics is 
defined by interpretations I — (A-^, ■^), where the non-empty set A-^ is the domain 
of the interpretation and the interpretation function --^ assigns a set C A"^ to 
every atomic concept A and a binary relation C A-^ x A-^ to every atomic role 
R. The interpretation function is extended to concept descriptions by an inductive 
definition. The syntax and semantics of ST-LIF DL is defined in Table [l] 

A description logic knowledge base, KB, is typically divided into an intensional 
part {terminological one, TBox), and an extensional part {assertional one, ABox). 
The TBox contains axioms dealing with how concepts and roles are related to each 
other {terminological axioms), while the ABox contains assertions about individuals 
{assertional axioms). A semantics is given to ABoxes by extending interpretations 
I — (A-^, ■'^) by an additional mapping of each individual name a to an element 
g r^Yie interpretation I satisfies a set of axioms (a TBox T or/and an ABox 
A) iff it satisfies each element of this set. 



3.1.2 Disjunctive Datalog 



Disjunctive Datalog (Eiter et al. 1997) is an extension of Datalog that allows dis- 
junctions of literals in the rule heads. 
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Table 1: Syntax and semantics of ST-LXT. 



Constructor Syntax Semantics 



Concept constructors 



Universal concept T A-^ 

Bottom concept ± 

Negation of arbitrary concepts (~'C) A^\C^ 

Intersection [c \1 D) n 

Union {C U D) U 

Value restriction (Vi?.C) {a e A^|V6 : {a, b) e ^ b e C^} 

Full existential quantification (3R.C) {a e A^\3b : {a, b) e A b e C^} 

Functionality <1_R i a G A^l \{b\{a, b) e R^}\ < 1 



Role constructors 


Inverse role 


R- 


{(a 


,b)GA^x A^\{b,a) e R^} 


Transitive role 


Trans(fi;) 


R^ 


is transitive 



Definition 1 [Disjunctive Datalog rule) 
A disjunctive Datalog rule is a clause of the form 

FiV...V77fc 
where Hi and Bj are atoms, and A; > 1, n > 0. □ 

Definition 2 [Disjunctive logic program) 

A disjunctive logic program P is a finite collection of disjunctive Datalog rules. □ 

We consider only disjunctive Datalog programs without negative literals in the 
body, that is, positive programs. 

For the semantics, only Herbrand models are considered, and the semantics of 
P is defined by the set of all minimal models M of P, denoted by MM{P). A 
ground literal L is called a cautious answer of P, written P |=c L, it L E M for 
all M G A4A4{P). FOL entailment coincides with cautious entailment for positive 
ground atoms on positive programs. 

3.1.3 DL-safe rules 



We use the formalism of DL-safe rules introduced in ( Motik et al. 2005 1 . The 
description logic STiXF and disjunctive Datalog rules are integrated by allowing 
concepts and roles to occur in rules as unary and binary predicates, respectively. 
Below we define DL-safe rules with respect to the description logic STiXF and 
disjunctive Datalog rules. 
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Definition 3 [DL-safe rules) 

Let KB be a knowledge base represened in the STiXF language. A DL-prcdicate 
is an atomic concept or a simple role from KB. For ti and t2 being constants or 
variables, a DL-atom is an atom of the form ^(^i), where A is an atomic concept in 
KB, or of the form R{ti,t2), where _R is a simple role in KB, or of the form ii = i2- 
A non-DL-predicate is any other predicate than —, an atomic concept in KB, or a 
role in KB. A non-DL-atom is an atom with any predicate other than an atomic 
concept in KB, and a role in KB. A (disjunctive) DL-rule is a (disjunctive) rule with 
DL- and non-DL-atoms in the head and in the body. A (disjunctive) DL-program 
f is a finite set of (disjunctive) DL-rules. A combined knowledge base is a pair 
(KB, P). A (disjunctive) DL-rule r is called DL-safe if each variable in r occurs in 
a non-DL-atom in the rule body. A (disjunctive) DL-program P is DL-safe if all its 
rules are DL-safe. □ 

In order to define the semantics of a combined knowledge base {KB, P), the KB 
axioms are mapped into a (disjunctive) Datalog program DD[KB), which entails 
exactly the same set of ground facts as KB. The details concerning the mapping can 



be found in (Motik 2006j>Iotik and Sattler 2006[ ). It is proved that KB is satisfiable 



with respect to the standard model-theoretic semantics of SUIT iff DD{KB) is 
satisfiable in first-order logic ( [Motik 2006 ). It is also proved in (Motik 2006 ) that for 



a combined knowledge base [KB, P) consisting of a STiXF knowledge base KB and 
a finite set of DL-safe rules P , {KB, P) ^ q iff DD{KB)U P \= a, for a ground atom 
a, where a is of the form A{a) or R{a, b), and A is an atomic concept. Therefore, 
reasoning in {KB, P) can be performed using the well-known techniques from the 
field of deductive databases. 

DL-safety implies that each variable is bound only to constants explicitely in- 
troduced in a {KB,P). Let us consider, for example, a combined knowledge base 
{KB, P) such that KB contains the concept Person and roles livesAt and worksAt, 
while P contains the following rule defining Homeworker as a person who lives and 
works at the same place: 

Homeworker{x) -s— Person{x), livesAt{x, y), worksAt{x, y) (1) 

This rule is not DL-safe. It is because the variables x and y that occur in the 
DL-atoms Person{x), livesAt{x, y), worksAt{x, y) do not occur in the body in any 
non-DL-atom. Let us introduce a special non-DL-predicate O such that the fact 
0{a) occurs for each individual a in the ABox. In order to make rule ([l} DL-safe, 
we add non-DL atoms 0{x) and 0{y) in the rule body, obtaining: 

Homeworker{x) -s— Person{x), livesAt{x, y), worksAt{x, y), 0{x), 0{y) (2) 

In order to express a DL-safe rule intuitively, we just append to the original 
rule the phrase: "where the identity of all objects is known". The rule ^ can be 
intuitively expressed as follows: "A Homeworker is a known person who lives at and 
works at the same known place" . 

A combined knowledge base {KB,P) may be divided into an intensional part, 
which contains knowledge independent of any specific instances, and an extensional 
part, which contains factual knowledge. 
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3.2 Problem of onto -relational frequent pattern discovery 

In this subsection we formally define the problem of frequent pattern discovery 
from knowledge bases represented in the DL-safe rules, as it is addressed in this 



paper. Initial formulation of this problem has been presented in (Jozefowska et al 



2005). This subsection specializes it. Let us start with an example of a combined 



knowledge base {KB,P). 

Example 1 {Example knowledge base (KB,P)) 

Given is a knowledge base {KB, P) describing bank services, presented in Table [2] 
For the clarity of presentation, non-DL-predicates are denoted with prefix p_. This 
knowledge base could not be represented in description logic or Datalog alone. 
Definite Horn rules require all variables to be universally quantified, and therefore 
it is impossible to assert the existence of unknown individuals. For example, it is 
impossible to assert that each account must have an owner. Moreover, Horn rules 
are unable to represent disjunctions in rule heads, and hence, it is not possible 
to model that the range of the role isOwnerOf is a disjunction of Account and 
CreditCard. In description logic, in turn, it is not possible to define a "triangle" 
relationship that is modelled by the rule defining pjamily Account. □ 

The task addressed in this paper is frequent pattern discovery. The patterns 
being found in our approach have the form of conjunctive queries over the combined 
knowledge base {KB, P). An answer set of a query contains individuals of a user- 
specified reference concept C. We assume that the queries are positive, i.e. they do 
not contain any negative literals. Moreover, we assume that the queries are DL-safe. 
This means that all variables in a query are bound to instances explicitly occurring 
in {KB,P), even if they are not returned as a part of the query answer. In this 
context a query is defined as follows. 

Definition 4 {Conjunctive DL-safe queries) 

Let {KB, P) be a combined knowledge base in DL-safe rules with KB represented 
in SHTJ- . Let x = {ii, . . . , be a set of undistinguished variables (the variables 
whose bindings are not a part of the answer) and key be the only distinguished 
variable (that is the variable whose bindings are returned in the answer). A con- 
junctive query Q{key, x) over {KB, P) is a rule using a special predicate name (that 
does not belong to the set of names occurring in {KB, P)) in the head, and whose 
body is a finite conjunction of atoms of the form B{ti, . . . , tn), where B is an n-ary 
predicate (either from the KB component or from the disjunctive Datalog program 
P) and ti,i = \, . . . ,n, is the distinguished variable key or a variable from x. A 
conjunctive query Q{key, x) is DL-safe if each variable occurring in a DL-atom also 
occurs in a non-DL atom in Q{key,x.). 

The inference problems for conjunctive queries are defined as follows: 

• Query answering: An answer to a query Q{key,x) w.r.t. {KB, P) is an assign- 
ment 6 of an individual to the distinguished variable key such that {KB, P) |= 
3x : Q{key9, x). 

• Query containment: A query Q2{key,X2) is contained in a query Qi{key,TCi) 
w.r.t. {KB,P) if {KB,P) ^ykey : [3x2 : Q2{key,ii2) ^ 3xi : Qi{key,yii)]n 
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Terminology in KB 



Client = 3isOwnerOf 

T C VisOwnerOf .Account U CreditCard 

3isOwnerOf~ C Property 
Gold C CreditCard 

relative = relative' 

Account C 3isOwnerOf~ 

T C VhasMortgage. Mortgage 
T C WhasMortgage~ .Account 
T C < lhasMortgage~ 

Account = -^CreditCard 



Rules in P 



P-family Account {x, y, z) ^ Account{x), 
isOwner{y, x), isOwner{z, x), 
relativeiy, z), 0{x), 0{y), 0{z) 

p.sharedAccount{x, y, z) <— 
p -family Account(x, y, z) 

P-man(x) V p_woman(x) Client{x), 0{x) 



A client is defined as an owner of something. 

The range of isOwnerOf is a disjunction of 
Account and CreditCard. 
Having an owner means being a property. 
Gold is a subclass of CreditCard. 

The role relative is symmetric. 

Each account has an owner. 

The range of hasMortgage is Mortgage. 
The domain of hasMortgage is Account. 
A mortgage can be associated up to one ac- 
count. 

Account is disjoint with CreditCard. 



p -family Account is an account that is 
co-owned by relatives. 

Family account is a shared account. 

A client is a man or a woman. 



Assertions in (KB, P) 



p_woman{Anna) 
isOwnerOf {Anna, al) 
hasMortgage{al, ml) 
relative{Anna, Marek) 

isOwnerOf {Jan, ccl) 
CreditCard{ccl) 

isOwnerOf {Marek, al) 

Account { account2) 

0{i) for each explicitly named individual i 



Anna is a woman. 

Anna is an owner of al. 

Mortgage ml is associated to account al. 

Anna is a relative of Marek. 

Jan is an owner of ccl. 
ccl is a credit card. 

Marek is an owner of al. 

account2 is an account. 

Enumeration of all ABox individuals 



Table 2: An example of a combined knowledge base. 
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In our approach patterns are positive (i.e., without negative Hterals) conjunc- 
tive DL-safe queries over the combined knowledge base {KB, P) addressing a user- 
specified reference concept C. The atom with a reference concept as the predicate 
contains the only distinguished variable key. 

Definition 5 {Pattern) 

Given is a combined knowledge base {KB, P). A pattern Q is a conjunctive, positive 
DL-safe query over {KB, P) of the following form: 

Q{key) C{key), B^,..., Bn,0{key),0{xi), 0{x,n) 

where Bi, . . . , Bn represent atoms of the query, Q{key) denotes that variable key 
is the only distinguished variable, and Xi,...,Xm represent the undistinguished 
variables of the query. Q{key) is called the head of Q, denoted head{Q), and 
the conjunction C{key), Bi, . . . , Bn,0{key),0{xi), . . . ,0{x,n) is called the body 
of Q, denoted body{Q). A trivial pattern is the query of the form: Q{key) = 
7- C {key), O {key). □ 

We assume each query posseses the linkedness property, that is each variable in 
the body of a query is linked to the variable A;ey through a path of atoms. 

Definition 6 {Linkedness) 

A variable x is linked in a query Q iff a: occurs in the head of Q or there is an atom 
B in the body of Q that contains the variable x and a variable y (different from 
x), and y is linked. □ 

Examples of patterns that can be discovered from the knowlegde base introduced 
in Example [T] are presented below. 

Example 2 {Example patterns) 

Consider the knowledge base {KB, P) from Example[l] Assuming that Client is the 
reference concept C , the following patterns over {KB, P), may be built: 

Qref{key) =? — Client {key), O {key) 

Qi{key) — ? — Client{key), isOwnerOf{key, x),0{key),0{x) 

Q2{key) =? — Client{key), is Owner Of {key , x), pJamilyAccount{x, key, z) 

Qz{key) =?— Client{key) , isOwnerOf {key , x), is Owner Of {key, y), 0{key), 0{x), 0{y) 

Qi{key) =1 — Client{key),isOwnerOf{key,x), Cr edit Card{x),0 {key ),0{x) 

where Qref is a reference query, counting the number of instances of (7. □ 

In order to define the task of frequent pattern discovery we need to define how 
to calculate the pattern support. 

Definition 7 {Support) 

Let Q be a query over a combined knowledge base {KB, P), answerset{C, Q, {KB, P)) 
be a function that returns the set of all instances of concept C that satisfy query Q 
with respect to {KB, P), and let Qref denote a trivial query for which the answerset 
contains all instances of the reference concept C in {KB, P). 

A support of query Q with respect to the knowledge base {KB, P) is defined as the 
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ratio between the number of instances of the reference concept C that satisfy query 
Q w.r.t. {KB, P) and the total number of instances of the reference concept C: 

supportiC, Q, {KB, P)) (^^' 



I answer set ( C, Qref , {KB ,P))\ 
□ 

The support is calculated as the ratio of the number of bindings of variable key in 
the given query Q to the number of bindings of variable key in the reference query 
Qref - The reference concept C determines what is counted. Let us now calculate 
the support of query Q2 from Example [2] 

Example 3 

For the illustration of the support notion, consider the queries from Example [2] 
The reference query has 3 items in its answer set that is 3 individuals from {KB, P) 
that are deduced to be Client due to the axiom defining a client as an owner of 
something. Query Q2, for example, has 2 items in its answer set that is the clients 
that are co-owners of at least one account with their relatives {Anna, Marek). The 
support of query Q2 is then calculated as: support{C , Q2, {KB, P)) = | w0.66.n 

Finally, we can formulate our task of frequent pattern discovery in a combined 
knowledge base {KB,P). 

Definition 8 {Frequent pattern discovery) 
Given 

• a combined knowledge base {KB, P) represented in DL-safe rules, where KB 
is represented in SUTJ- and P is a positive disjunctive Datalog program, 

• a set of patterns in the form of queries Q that all contain a reference concept 
C* as a predicate in one of the atoms in the body and where the variable in 
the atom C is the only distinguished variable, 

• a minimum support threshold minsup specified by the user, 

and assuming that queries with support s are frequent in {KB,P) if s > minsup, 
the task of frequent pattern discovery is to find the set of frequent queries. □ 

Example 4 

Let us assume the threshold minsup=0.5 and let us consider the queries from Ex- 
ample [2j The set of frequent patterns is then {Qref, Qi, Q2, O3}. □ 

4 Solution algorithm 

The main contribution of this paper is the algorithm for frequent pattern discovery 
in combined knowledge bases represented in DL-safe rules as described in Section 
3} Initial re sults on the algorithm development have been presented in ( Jozefowska 



et al. 2006 Jozefowska et al. 2008 ). This section advances them. Our method follows 
the usual approach where the search starts with the most general patterns and 
refines them to more specific ones in consecutive steps. Thus, firstly, we define 
the generality relation and further the refinement operator that computes a set of 
specializations of a pattern. 
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4.1 Generality relation 

We use a semantic generality relation in order to fully utilize the information stored 
in the combined knowledge base {KB, P). As we have defined in Sectionjsj patterns 
are represented as queries, so it seems natural to define the generality relation as 
the query containment (or subsumption) relation. 

Definition 9 {Generality relation) 

Given two patterns Qi and Q2 defined as queries over a combined knowledge base 
{KB,P) (see Definition [5]) we say that pattern Qi is at least as general as pattern 
Q2 under query containment w.r.t. {KB.P), Qi }zb Q2, iff query Q2 is contained 
in query Qi w.r.t. {KB,P). □ 

Theorem [1] lays the foundations for an algorithm to test the pattern subsumption. 
Theorem 1 

[Testing ^e] Let Qi{key,Xi) and Q2{key,X2) be two queries and {KB,P) be a 
combined knowledge base. Let be a substitution grounding the variables in Q2 
using new constants not occuring in {KB, P) (Skolem substitution). Then Qi Q2 
if and only if there exists a ground substitution a for Qi such that 

(i) head{Q2)0 = head{Qi)a and 

(ii) {KB, P) U body{Q2)9 h body{Qi)a 

Proof 

{^) Assume there exists a ground substitution a for Qi such that (i) and {ii). 
Let a be some individual, la be some interpretation of {KB, P) which is a model 
of {KB,P) such that a is an answer to the query Q2 in Z^. In order to prove 
that Qi Q2 we need to prove that a is also an answer to the query Qi in 
Xa ■ By definition of query answering (Definition [4| there exists a substitution 
such that a is identical to keyip and 3body{Q2)<p is true in 1^. Since Q2 is DL-safe 
there must exist another substitution, 0', such that (52^' is ground, a is indentical 
to keycj)' and body{Q2)(f>' is true in Xa- Because formula {KB,P) U body{Q2)9 |= 
body{Qi)a is valid, by the uniform replacement of constants we have head{Q2)4'' — 
head{Qi)a and {KB, P)U body{Q2)4'' \= body{Qi)a, so body{Qi)a is also true inl^. 
Because head{Q2)4>' is identical to head{Qi)a this implies a = keycj)' is an answer to 
the query Qi . This argument follows for any interpretation I satisfying the initial 
constraints, so Qi Q2- 

(^) Assume Qi he Q2- The following arguments show that a ground substitution 
a exists. Let a substitution 6 be given as in the theorem. Let I be a model of 
{KB, P) U Q2d. Since keyd is an answer to Q2 in X, keyO is also an answer to Qi in 
X. Moreover, since Qi is DL-safe, there must exist a ground substitution (p such that 
head{Q2)0 — head{Qi)(f), and body{Qi)(j) is true in X. By the uniform replacement 
of constants we obtain that head{Q2)0 — head{Qi)a , and body{Qi)a is true in X. 
This argumentation is valid for any interpretation satisfying the constraints, so the 
thesis follows. □ 

Below we prove that appending an atom to a query results in an equally or more 
specific query which gives an easy way to building specializations of a query. 
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Proposition 1 

Let Q2 be a query over {KB, P), built from query Qi by adding an atom. It holds 
that Qi >B Q2- 



Proof 

Let us consider query Qi =? — C{key), _Bi, . . . , _B„ and let us add atom -B„+i to Qi 
obtaining query Q2 ~ C{key), _Si, . . . , _S„. B„^i. Let 9 be an answer to query 
Q2. According to Definition |4) {KB,P) \^ 3x : Q2{keye,x). But since a query is a 
conjunction of atoms, it follows that also {KB,P) |= 3x : Qi{key9,x). Thus, the 
answer set of query Q2 is a subset of the answer set of query Qi what completes 
the proof. □ 

A crucial property of the generality relation that allows to develop efficient algo- 
rithms is monotonicity with regard to support. 

Proposition 2 

Let Qi and Q2 be two queries over the combined knowledge base {KB, P) that 
both contain the reference concept C. If Qi >b Q2 then support{C, Qi, {KB, P)) > 
support{C, Q2,{KB,P)). 



Proof 

If Qi Q2 then, by Definition[9] query Q2 is contained in query Qi. Further, from 
Definition [4] we conclude that since query Q2 is contained in query Qi then for any 
possible extensional part of {KB, P), while keeping the same intensional part, the 
answer set of Q2 is contained in the answer set of Qi, and in consequence by 
Definition [t) support{C,Qi,{KB,P)) > support{C, Q2, {KB , P)), what completes 
the proof. □ 

The monotonicity of the query containment with regard to the query support 
means that none of the specializations of an infrequent pattern can be frequent. 
The generality relation is a reflexive and transitive binary relation, and so it 



is a quasi-order on the space of patterns. It is known ( Nienhuys-Cheng and de Wolf 



1997) that any quasi-ordered space may be searched using refinement operators. In 



the next section we define the refinement operator used in our algorithm. 



4-2 Refinement operator 

We define a downward refinement operator that computes a set of specializations of 
a query. This set is obtained using both syntax and semantics of the query. Firstly, 
a query is appended with a single atom according to the rules given in Definition 
|10| In the second step semantic tests are performed which may exclude further 
patterns from consideration. 

It is convenient to represent the results of the refinement steps on a special trie 



structure that was introduced in the FARMER method (Nijssen and Kok 2001 



Nijssen and Kok 20031. Trie is a tree with nodes corresponding to the atoms of 



the query, so that each path from the root to any node corresponds to a query. In 
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Client(key) 




1^.-^ / 2 \ 2 1 / 1 \ 

hasMortgage(xl, x2) telative(x2, key] p_woman[key] i50wnerOf(xl, x2) p_woman(xl] Client(xl) p_woman[key) 



V 



relative [x3, key) p_woman(key) 



Fig. 1: A part of the trie constructed for the {KB, P) from Examplejl] C~Client, 
minsup = 0.2. 



consequence every node in a trie defines a query. According to Propositions [T] and 
|2]only nodes which correspond to frequent queries need to be expanded further. 

An example of the trie data structure for a data mining problem defined over 
the knowledge base from Example [l] is presented in Figure [l] In order to build the 
patterns the following predicates from the knowledge base were selected: Client, 
isOwnerOf , relative, hasMortgage and p^woman. Notice that the special purpose 
predicates (the ones of the form ©(a;)) are omitted from the presentation in the trie. 
The presence of such predicates indicates that a query is DL-safe. As we assume 
that all queries within our approach are DL-safe, we can omit the special purpose 
predicates for simplicity. The superscripts in Figure [l] correspond to the two ways 



described in Definition 10 in which atoms are added to the query. 



Definition 10 

Let T be a trie data structure that imposes an order of atoms in a query. Let Q be 
a query, let B be last{Q) that is the last atom in query Q, let Bp be the parent of 
i? in T". A variable is called new if it does not occur in any earlier atom of a query. 
Atoms are added to trie T as: 

1. dependent atoms (share at least one variable with last{Q), that was new in 
last{Q)), 

2. right brothers of a given node in T (these are the copies of atoms that have 
the same parent Bp as the given atom B and are placed on the right-hand 
side of B in Bp's child list), new variables are renamed such that they are 
also new in the copy.D 

The first rule introduces the dependent atoms that could not be added earlier. 
The dependent atoms are brothers of each other in the trie. The second rule, the 
right brother copying mechanism, takes care that all possible subsets but only one 
permutation out of the set of dependent atoms is considered. 

Let us now introduce the semantic tests which are performed as the second step 
of the refinement procedure in order to reduce the set of patterns submitted for 
frequency evaluation. Due to the efficiency reasons semantic tests are performed 
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on a knowledge base {KB,P) with ground facts like concept and role assertions 
removed. This reduced knowledge base is denoted by cp{KB , P). The first test 
consists in determining the query satisfiability, further ones check for some kinds 
of semantic redundancy as described later in this section. 

The test for checking satisfiability of query Q{key,x.) with regard to knowledge 
base cp{KB, P) consists in checking whether cp{KB, P)U{3key, x : Q } is satisfiable 
that is whether there is a model of cp{KB, P) in which there is some valuation for 
the distinguished variable key and undistinguished variables x. The variables are 
skolemized, and assuming that a and b are new constants, Q{a,h) is asserted to 
cp{KB, P). Then it is checked whether the updated cp{KB,P) is satisfiable. The 
query satisfiability test described above is defined in Definition [TT] below. 

Definition 11 

Query Q is satisGable w.r.t. a combined knowledge base {KB, P) iff {KB, P) U Q9 
is satisfiable, where is a Skolem substitution. □ 

Example 5 

Let us consider the knowledge base {KB,P) from Example [l] and the query: 

Q{key) =? — Account {key) , CreditCard{key),0{key) 

Since in {KB, P) the concepts Account and CreditCard are specified as disjoint, 
we know a priori that it is useless to submit the query Q as it cannot have any 
answer due to its unsatisfiability. □ 

After performing the satisfiability test, the queries are further pruned in order to 
obtain only those candidates that are not semantically redundant. We consider two 
kinds of semantic redundancy. The first kind occurs when a query has redundant 
atoms, that is atoms that can be deduced from other atoms in the query. The second 
kind occurs when there are frequent queries already found in the earlier steps that 
are semantically equivalent to the newly generated candidate. 

In order to avoid the first kind of redundancy, the queries are tested for semantic 
freeness. Only semantically free queries are kept for further processing. The notion 



of the semantic freeness has been introduced in ( de Raedt and Ramon 2004 ) . It is 
adapted to our setting as follows. 

Definition 12 {Semantically free pattern) 

A pattern Q is semantically free or s-free w.r.t a combined knowledge base {KB, P) 
if there is no pattern Q' , built from Q by removing any atom, such that Q Q' ■ 
□ 



Example 6 

Given is the knowledge base from Example [l] and the following queries to this 
knowledge base: 

Qi{key) —7 — Account {key) , isOwnerOf{x, key), 0{key),0{x) 

Q2{key) =? — Account{key),isOwnerOf{x,key), Client {x),0 {key), 0{x) 
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Query Qi is s-free while query Q2 is not. The reason why the second query is not s- 
free is that atom Client (x) can be deduced from the other atoms of this query. More 
specificaUy, the atom Client{x) can be deduced from the atom isOwnerOf {x , key) 
as from the axioms in the knowledge base it follows that any object being asserted 
to the domain of isOwnerOf is a Client. □ 

Moreover, the test for semantic freeness is performed on a query with the atom 
C(key) removed. It is motivated by the fact, that some queries could be pruned 
after the s-freeness test, that we do not necessarily would like to be pruned, just 
because of the obligatory presence of the reference concept in each query. 

Example 7 

Consider the knowledge base from Example [l] and Client as a reference concept. 
Then query: 

Q{key) =? — Client{key), isOwnerOf {key , x), 0{key), 0{x) 

does not pass the s-freeness test from Definition [I2] because the atom Client (key) 
can be deduced from the second atom of Q. However, the atom Client{key) contains 
the reference concept, which is obligatory in each query. Consider now query Q', 
obtained by removing the atom with a reference concept from query Q: 

Q' {key) — isOwnerOf {key , x),0 (key), 0{x) 

The modified query, Q', is s-free. Reconsider the queries from Example [2j Queries 
Qref, Qi, Q2 and Q4 are s-free with regard to the modified s-freeness test, while 
query is not s-free. □ 

A candidate query may be semantically redundant not only due to redundant 
atoms. The second kind of redundancy occurs when a candidate query is seman- 
tically equivalent to a frequent one already found. Such patterns are also pruned, 
which is performed by searching the trie for a pattern equivalent to the given 
one. So-called optimal refinement operator assures that no pattern is generated 
twice. Using the trie data structure and pruning the candidate patterns that are 
semantically equivalent to the ones already found, make our refinement operator 
optimal. By pruning semantically equivalent patterns we achieve also the property 
of properness of the refinement operator, that is every pattern Q' generated by the 
refinement operator is more specific than the pattern Q being refined ( Q' is never 
equivalent to Q). 

4-3 The algorithm 

The approach proposed in this paper follows the common scheme of algorithms for 
finding frequent patterns which is a "generate-and-test" approach. In such approach 
candidate queries are repeatedly generated and tested for their frequency. In order 
to generate candidates, a refinement operator is applied. 

The proposed, recursive node expansion algorithm is presented below. A node 
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being expanded is denoted by ni, Q{key, x) denotes a query with x being undistin- 
guished variables, d denotes the depth of the current node in the trie T. The trie 
is generated up to the user-specified MAXDEPTH depth. 

Algorithm 1 

expandNode(nd, Q{key,x), d, T, MAXDEPTH) 



1. if rf < MAXDEPTH then 

2. while all possible children of rid not constructed do 

3. construct child node rid+i and associated query Qc{key,x) using the trie data 
structure T and refinement rules from Definition 1 101 

4. if Qc{key,x.) is satisfiable wrt [KB,P) then 

5. if Qc{key,yi) is semantically free wrt {KB,P) then 

6. if Qc{key,-x.) is not semantically equivalent wrt {KB,P) to any frequent 
query found earlier then 

7. evaluate candidate query Qc{key,x.) 

8. if Qc{key,-x.) is frequent then 

9. addChild(n(j, n^+i); j j add n^+i as a child of Ud 

10. T^TUm+v, 

11. for all children rid+i of node rid do 

12. expandNode(nd+i, Qc{key,x), d + 1, T, MAXDEPTH) 



Completeness of search Below we prove the completeness of our method for pattern 
refinement, that is we prove that the proposed approach to pattern mining generates 
for each pattern Q from the space of valid patterns a valid pattern Q' such that Q' 
is semantically equivalent to Q. Valid patterns are those, from the ones defined in 
Definition [Sj that are linked and semantically free. In order to prove completeness, 



we relate to the work on FARMER ( jNijssen and Kok 2003 ), that originally used 



trie data structure for relational, frequent pattern mining. 

First we prove that pruning semantically equivalent patterns (after s-freeness test 
or after the search on the trie) does not exclude adding all possible refinements to 
a pattern. 

Lemma 1 

Let {KB^ P) be a combined knowledge base, and Qi, Q2 be two semantically equiv- 
alent patterns (Qi =b Q2) over {KB, P). Then for each variable x in Qi there exists 
a corresponding variable x' in Q2 to which the same bindings can be made as to 
the variable x. 



Proof 

By definition (Definition[5| both patterns have the same distinguished variable key, 
so the thesis follows for x = key. Let us now provide the following argumentation for 
x being an undistinguished variable. Since Qi =b Q2 then also Qi >b Q2- Suppose 
is a Skolem substitution grounding variables in Q2 that satisfies the constraints 
from Theorem [l] By definition, the substitution 6 assigns a new individual a to 
variable key. The individual a is an answer to Q2, and since Qi c^e Q2, a is also an 
answer to Qi. For Qi Q2 to be valid there must exist a grounding substitution a 
for Qi that satisfies the constraints from Theorem[T] Since Q2 is linked, that is all of 
its variables are linked to the variable key, then also all the constants introduced by 



20 



J. Jozefowska and A. Lawrynowicz and T. Lukaszewski 



9 are linked to the individual a. Since Qi is linked, then all the constants that bind 
to variables of Qi to prove the answer a have to be linked to a as well. Since a and 
all the constants introduced to the {KB, P) by 9 are new, then any other constants 
in the {KB, P) are not linked to a. In consequence, only the constants introduced 
to the {KB,P) by 9 can be a part of the substitution a. Then for each variable 
X in Qi there must exist a constant b that is assigned to x by the substitution a 
and has been introduced by 9. That is why there must exist a variable x' in Q2 
for which 9 introduces b, and what follows the same bindings that can be made to 
variable x in Qi can be as well made to the corresponding variable x' in Q2. This 
argumentation is valid for any variables x and a;', what completes the proof. □ 

The following corollary is a consequence of Lemma [l] 

Corollary 1 

Let {KB, P) be a combined knowledge base, and Qi, Q2 be two semantically equiv- 
alent patterns (Qi =b Q2) over {KB, P). Then for each variable x in Qi there exists 
a corresponding variable x' in Q2 such that any atom B that can be linked to Qi 
through the variable x can be also linked to Q2 through the variable x' . 

Subsequently we prove that all possible refinements of a pattern are generated. 

Lemma 2 

Given is a trie T, recursively generated by Algorithm [TJ a query Q which occurs in 
T, and an atom B ^ Q which is a valid refinement of Q. Then either: 

(i) valid query Q' — {Qi,B, Q2) exists in trie T, for some subdivision of Q into 
Qi and Q2, such that Q — {Qi, Q2) or 

(ii) valid query Q" exists in trie T, such that query Q" is semantically equivalent 
to query Q'. 



Proof 

Consider case (i). As B is a valid refinement of Q, there is a prefix {Qp,Bp) of Q 
such that atom 5 is a dependent atom of Bp. If Bp is the last atom of Q, then it 
is clear that B, as a dependent atom of Bp, is generated as a refinement of Q to 
be added at the end of the query. Dependent atom B is generated by the first rule 



from Definition 10 and checked for its validity (satisfiability and s-freeness). Hence, 
query Q' is generated. Let us assume now that Bp is not the last atom and it has 
different successor -Bp+i in query Q. Atom -Bp+i is also a child of Bp in T. Then 
let us consider the order of B and Bp+i in the list of children of Bp in trie T, which 
is one of the following: 

B occurs before -Bp+i; then Bp+i is a right-hand brother of B. The right brothers 



copying mechanism, the second rule from Definition 10 will copy Bp+i as a child 
of B; the same operations that created Q will create query Q' in subsequent steps. 
B occurs after Bp+i; B is copied as a child of Bp+i. In order to determine the exact 
injection place of B, we recursively apply our arguments, taking into account Bp+i 
and B. 
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It follows from the above arguments that query Q' is always generated. After gener- 
ation of query Q' , it is checked, in line 6 of Algorithm[l] if query Q' is semantically 
equivalent to some query Q", already present in the trie T. If it is the case, Q" is 
kept in T, and Q' is not added to T. Otherwise, the newly generated query Q' is 
added to the trie T. Thus, either query Q' exists in the trie T or it is semantically 
equivalent to query Q" . This completes the proof. □ 

Finally we prove the completeness. 

Theorem 2 (Completeness) 

For every valid, frequent query Qi in the pattern space, there is semantically equiv- 
alent valid query Q2 in the trie T. 

Proof 

Let us assume that queries are generated up to the user specified length (MAXDEPTH) . 
For query Qi of length 1 it is obvious that there is a corresponding query Q2 of the 
form Q{key) =? — C{key), 0{key) in the root of the trie (atoms of the form 0{x) 
are not taken into account as described earlier). For query Qi of length > 1, the 
proof is by induction on the length of the query. Assume that an equivalent query 
for Qi\last{Qi) exists in trie T. From Corollary [l] follows that any refinement that 
can be made to Qi\last{Qi) can be also made to any of its equivalent queries. 
If atom last{Qi) is a valid refinement of the equivalent query. Lemma [2] applies. 
Hence, the thesis follows by induction. □ 



4.4 Implementation 

The proposed method employs several reasoning services run over a combined 
knowledge base {KB, P) such as: (conjunctive) query answering, deciding knowledge 
base satisfiability, deciding concept subsumption, classifying the concept hierarchy. 
In order to perform all these reasoning services, specialized and complex algorithms 
are needed. As the implementation of such reasoning services is out of the scope of 
this work, to test our ideas we decided to use an external reasoner KAON^^ 

In the core of KAON2 there is an algorithm for reducing a DL knowledge base 
KB into a disjunctive Datalog program DD{KB) on which the actual reasoning 
is performed using the techniques of deductive databases. In particular, KAON2 
uses a version of Magic Sets optimization technique, originally defined for non- 
disjunctive programs and recently extended to disjunctive Datalog, in order to 
identify the part of the database relevant to the query. And it applies semi-nai've, 
bottom-up evaluation strategy, in order to avoid redundant computation of the 
same conclusions. Employing these techniques makes KAON2 well suited for a fre- 
quent pattern mining application. It has been experimentaly shown that in case of 
the knowledge bases with relatively small intensional part, but large number of in- 
stances, KAON2 outperforms the reasoners using the classical tableaux algorithms 



by one to two orders of magnitude (Motik and Sattler 2006 Parsia et al. 2006). 

http://kaon2.semanticweb.org 
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Disjunctive Datalog 
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Fig. 2: Reasoning in KAON2. 



Figure [2] presents an overview of the reasoning in KAON2. 

We implemented the proposed method for pattern mining in a system caUed 
SEMINTECp] [Semantically- enabled data mining techniques). Our implementa- 
tion is written in Java (version 1.5). It uses KA0N2's API to manipulate and 
reason on combined knowledge bases. Figure [3] presents the input and output of 
our system and illustrates the interaction with the reasoner. As an input to the 
system, the user is expected to provide the following files: setup Gle (in XML for- 
mat, with the parameters of the execution such as the logical and physical URI of 
the knowledge base, reference concept, minimum support threshold etc.) and knowl- 
edge base files (in OWL and S WRIp^ formats) . As an output the system generates 
the files with: frequent patterns discovered during the execution, statistics of the 
execution, and a file with a trie that stores patterns, in XML-based GraphMIp^ 
format. The implementation of SEMINTEC is publicly availabl^^ 



5 Experimental evaluation 

In this section, we present an experimental evaluation of the proposed method for 
frequent pattern mining with the focus on the usefulness of exploiting the semantics 
of the knowledge base at different steps of our algorithm. In particular, the goals of 
the experiments were to investigate the influence of using intensional background 
knowledge expressed in DL with DL-safe rules on the data mining cfRciency (i.e., 
computing time) and the quality of the results (i.e., the number and the form of the 
discovered patterns). We wanted to test how our method performs on datasets of 
different sizes and complexities, in order to obtain an idea what kinds of ontologies 



http: / /www. cs.put.poznan.pl/alawrynowicz/semintec. htm 

www.w3.org/Submission/SWRL/ 

http: / /graphml. graphdrawing.org 

http: //www. cs.put.poznan.pl/alawrynowicz/semintec. htm 
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Fig. 3: SEMINTEC input /output and interaction with the reasoner. 

can be handled efficiently. In particular, the experiments were supposed to answer 
the following questions: 

• how using the intensional part of the background knowledge for the semantic 
tests of generated patterns influences the execution time and the results of 
pattern discovery? 

• how the complexity of the intensional background knowledge, in particular 
the types of DL constructors and DL axioms, influences the execution time 
and the results of pattern discovery? 

• how exploiting concept and role taxonomies influences the execution time of 
pattern discovery? 

Test datasets For the tests we used three datasets, whose general characteristics 
is presented in Table [sj The (FINANCIAL|^dataset was created on the basis of 
a dataset from the PKDD'99 Discovery Challenge as a part of our research pre- 
sented in this paper, and currently is the part of the benchmark suite of KAON2. 
FINANCIAL ontology describes the domain of banking. FINANCIAL dataset 
is relatively simple, as it does not use existential quantifiers or disjunctions. It 
contains, however, functional roles and disjointness constraints. Thus, it requires 
equality reasoning, which is difficult for deductive databases. 

SWRC ontology, as used in our experiments, was published at the 4th Inter- 
national EON Workshop (EON2006]^ It was a part of the testbec^ used in the 
ontology evaluation session at the workshop. SWRC ontology {"Semantic Web 



15 FINANCIAL, Ihttp: / /www .cs.put.poznan.pl/aIawrynowicz/financiaI.owI 
1® http://km.aifb.uni-karlsrulie.de/ws/eon20U6 
1^ http: / /km. aifb.uni-karlsruhe.de/ws/eon2006/ontoeval. zip 
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Table 3: Characteristics of the test datasets. 



datasct 


DL 


#conccpts 


#obj. roles 


#rulcs 


^individuals 


FINANCIAL 


ACCXT 


60 


16 





17941 


rSWRC 


ACX{V>) 


55 


44 


3 


2156 


rLUBM 


SHX(D) 


43 


25 


2 


17174 



Rules in rSWRC 



p_knowsAboutTopic(x, z) <— Person(x), worksAtProject{x , y), isAbout{y, z) 
p_coAuthoredByFuUProfessor{x) <— Article{x), author{x, y), FullProfessor{y) 
finances{x, z) <— Organization(x), finances(x, y), Project(y), isAbout{y, z) 



Rules in rLUBM 



GraduateStudent{x) <— Person(x), takesCourse{x , y), GraduateCourse(y) 
P-specialCourse(z) <— FullProfessor (x) , headOf{x, y), teacherOf{x, z) 



for Research Communities" ) represents knowledge about researchers and research 
communities. Instance data, published at the EON website, describes the AIFB 
Institute of the University of Karlsruhe. The TBox of this ontology contains con- 
cept inclusion axioms, universal quantification, but no existential quantifiers, and 
no disjunctions, so it is simple. By rSWRC we denote our extension of this dataset 
by the rules presented in Table [3] 

LUBM is a benchmark from the Lehigh Universit}^] consisting of a university 
domain ontology and a generator of synthetic data. Existential quantifiers are used, 
but no disjunctions or number restrictions occur, hence the reduction algorithm of 
KAON2 produces an equality-free Horn program, on which query answering can 
be performed deterministically. In the experiments we used rLUBM, an extension 
of LUBM ontology by two rules (presented in Table [s]) which was proposed by the 
authors of the DL-safe rules component of Pellet in ( |Parsia et al. 2006( ). 

Test setting All tests were performed on a PC with Intel Core2 Duo 2.4GIIz pro- 
cessor, 2GB of RAM, running Microsoft Windows Server 2003 Standard Edition 
SPl. The JVM heap size was limited to 1.5GB. We used the version of KAON2 
released on 2008-01-14. 



5. 1 Results of the experiments 

5.1.1 Analysis whether semantic tests of generated patterns are useful 

The goal of this experiment was to compare the setting where intensional back- 
ground knowledge was used for testing generated candidates as well as for evaluat- 



LUBM, jhttp://swat.cse. lehigh.edu/projects/lubm/ 
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ing them with the setting where the background knowledge was used only during 
the candidate evaluation. We were interested in efhciency and quality of the re- 
sults. The bias consisted of restricting the predicates, used to build patterns, only 
to those having any extension (to avoid testing predicates without any assertions), 
and giving new names to all variables in the newly added dependent atoms, except 
the variables shared with the last atom in a query. 

In the first setting, SEM, the original algorithm for query expansion was used, 
that is Algorithm[l] In the second setting, NOSEM, the algorithm was run without 
the steps for checking pattern satisfiability, s-freeness and equivalence with already 
found frequent patterns, that is, lines 4-6 from Algorithm [l] were omitted. However, 
the other parts of the solution such as the trie data structure as well as the tech- 
niques for reducing syntactic redundancy based on the trie were left unchanged, and 
used in the second setting as well. Hence, some assumptions made for the kinds of 
patterns expected as the result of the execution of our method were applied for both 
settings. In particular, syntactically non-redundant copies of atoms, in which output 
variables were given new names, were not generated as dependent atoms in both set- 
tings. Not generating copies of atoms, which is based on the assumption of generat- 
ing only s-free candidate patterns, greatly influences the time and the results of the 
pattern mining, as without the semantic tests for redundancy, one could not avoid 
chains like: Client{x) , isOwnerOf {x ^ yi) , isOwnerOf {x , 112) ■, isOwnerOf {x , y^) , ■ ■ ■. 
Thus, we compare our proposed setting with the one which is not strictly naive 
and which lacks the most time consuming operations. 

The parameters measured during an execution of the experiment, were: (i) run- 
ning time [runtime)^ (ii) number of candidate patterns (cand), (iii) number of fre- 
quent patterns (freq) . Good results are characterized by low number of candidates 
and frequent patterns, and short running time. Additionally, a ratio of frequent 
patterns to candidate patterns should be as high as possible, that is, as few as 
possible unproductive candidate patterns should be evaluated. 

Qualitative analysis Below we present and discuss some patterns discovered during 
the experimental evaluation. We restrict the analysis to the ontologies with real 
(nonsynthetic) data. 

The following is one of the longest patterns discovered from the FINANCIAL 
dataset, by our method {SEM setting): 

QsEMi{key) = Client{key) , hasOwner{xi, key), hasStatementIssuanceFrequency{xi, X2), 
Monthly{x2), hasPermanentOrder{xi, X3), isPermanentOrderFor(x3, 15), Household— 
Payment{xr,), hasAgeValue{key, xr), hasSexValue{key , xs), FemaleSex^xs), livesln{key , xio); 
support =0.29 

It describes "a client who is an owner of an account with monthly statement is- 
suance frequency, and with a permament order for household payment, who is a 
female, lives in some region, and is at some age" . The information that Account is 
here the domain of hasOwner and Region is the range of livesin comes from the 
FINANCIAL ontology. One may notice, that the region in which the client lives 
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and the age at which she is, is not specified in this pattern. Example, shorter pat- 
terns discovered, that involve roles hasAgeValue or livesin and precise their range 
are shown below: 

QsEM2{key) = Client(key), hasOwner{xi, key), hasStatementIssuanceFrequency{xi, X2), 
Monthly{3>2), hasAgeValue{key , X4), From35To50{x4); support=0.21 
QsEMs{key) = Client{key), livesln{key, xi), NorthMoravia{xi); support=0.17 

An example of a pattern discovered by running NOSEM setting is as follows: 

QNOSEMi{key) = Client{key), livesln{key, xi), Region{xi); support=1.0 

The pattern Qnosemi has the semantically redundant atom, Region{x\), due to 
the specification of Region as the range of role livesin in the FINANCIAL KB. 

Let us now present the example patterns discovered from the rSWRC dataset. 
By running the SEM setting, the following example patterns have been discovered: 

QsEMiikey) = Person{key), author{xi, key), publication {x2, xi) , p -knows AboutTo — 
pic{x2, .T3); support=0. 70 

QsEMb{key) = Person{key), author{xi, key) , publication{key , x^) , Publication{x2); 
support=0.75 

The meaning of pattern Qsema may seem unclear with regard to the rSWRC 
knowledge base. In the knowledge base neither ranges nor domains of author and 
publication arc specified. However, from the rule defining p JtnowsAboutTopic we 
know that its first argument represents Person and the second one Topic. Thus, 
we may conclude that the pattern says that "some person, who knows about some 
topic, is related to the publication who is authored by the person represented by 
the reference concept" . From the intensional part of the {KB,P) we do not know 
about the nature of this relation for Person as role publication is missing domain 
and range specifications. By deeper analysis of the knowledge base, we may notice 
that concept AcademicStaff , that is the subconcept of Person, is subsumed by con- 
cept y publication. Publication. Thus for academic staff, a particular type of persons, 
publication range is Publication. 

In pattern Qsem5, a person is related by role publication with some Publication. 
By the common sense reasoning, this pattern carries redundant information. It is, 
however, s-free, as the range of publication is not specified in the {KB, P). 

With regard to the NOSEM setting let us discuss the following pattern: 

QNOSEMiikey) = Person{key),publication{key, xi), Publication{xi), InProceedings{xi); 
support=Q.A7 

Since in the [KB, P), InProceedings is the subconcept of Publication, atom Publication{xi) 
is semantically redundant. 
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Table 4: Results of the experiment on effectiveness of the semantic tests. 



Max 


number of patterns 


reduction 


runtime [s] 


speedup 


Length 


NOSEM SEM 
cand frcq cand frcq 


NOSEM/SEM 
cand frcq 


NOSEM SEM 


NOSEM/SEM 



FINANCIAL, minsup=0.2, reference concept= CKent 



1 


1 


1 


1 


1 


1.00 


1.00 


0.5 


0.5 


1.07 


2 


91 


9 


15 


7 


6.07 


1.29 


42.7 


14.2 


3.01 


3 


582 


69 


71 


27 


8.20 


2.56 


303.5 


104.5 


2.91 


4 


2786 


479 


253 


68 


11.01 


7.04 


2931.6 


569.7 


5.15 


5 






569 


131 








2166.9 




6 






1009 


214 








6042.7 




7 






1524 


303 








12204.8 




8 






1963 


376 








20200.0 




9 






2307 


421 








26346.1 




10 






2513 


440 








29614.2 




11 






2608 


444 








30309.6 




12 






2634 


444 








30821.6 





rSWRC, minsup=0.3, reference concept=Person 



1 


1 


1 


1 


1 


1.00 


2 


92 


3 


92 


3 


1.00 


3 


279 


22 


271 


14 


1.03 


4 


1556 


272 


913 


100 


1.70 



1.00 0.1 0.1 1.01 

1.00 7.4 16.5 0.45 

1.57 23.0 155.2 0.15 

2.72 169.1 2533.5 0.07 



rLUBM, minsup=0.3, reference concept=Persori 



1 


1 


1 


1 


1 


1.00 


2 


68 


7 


67 


6 


1.01 


3 


361 


63 


269 


31 


1.34 


4 


2885 


789 


1438 


194 


2.01 



1.00 0.3 0.3 1.00 

1.17 12.5 16.5 0.76 

2.03 82.8 142.6 0.58 

4.07 9713.0 3486.7 2.79 



Quantitative analysis Table [4] shows the results for a selected support threshold 
for each dataset. The results are shown up to the lengths of patterns where either 
an execution of the proposed method (SEM) has not exceeded the threshold of 24 
hours of the running time (rSWRC, rLUBM) or the whole trie was generated in 
this setting (FINANCIAL). From the presented results one can conclude that with 
regard to the reduction in the number of patterns, there is a gain for all datasets, 
reaching 11.01 times for candidate patterns and 7.04 times for frequent patterns in 
case of the FINANCIAL dataset. 

With regard to the running time, in case of FINANCIAL dataset, the speedup 
has been reached for all maximum lengths of patterns. For longer patterns, the 
NOSEM setting was unable to finish execution in 24 hours, while executing the SEM 
setting allowed to generate the whole trie of frequent patterns for FINANCIAL 
dataset. In case of rLUBM dataset, the speedup has been reached for the longest, 
most important, maximum pattern length. For rSWRC, however, the NOSEM set- 
ting was significantly better with regard to the running time. 
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We also measured the method performance for different minimum support thresh- 
olds. The results are reported in Figure [4j The bars representing the numbers of 
frequent patterns are superimposed on those, representing the numbers of candidate 
patterns. In case of the FINANCIAL dataset the differences between the numbers 
of candidate patterns in the SEM setting in comparison to the NOSEM setting 
are the largest from among those of the tested datasets. The number of candidate 
patterns in the SEM setting constitutes about 9% of that in the NOSEM setting. 
For the rLUBM dataset this ratio is about 45% on average and for rSWRC is 
about 59% on average. The number of frequent patterns in the SEM setting is on 
average equal to about 14% of the number of frequent patterns in the NOSEM set- 
ting for the FINANCIAL dataset, about 22% for rLUBM dataset and about 37% 
for rSWRC dataset. Since in case of the FINANCIAL dataset, the differences in 
pattern numbers between the SEM setting and the NOSEM setting are the largest 
from among the tested datasets, relatively the biggest number of semantically re- 
dundant patterns is pruned away for this dataset, while for the rSWRC dataset 
this number is the lowest one. 

Let us now discuss the ratio between the number of frequent and the number of 
candidate patterns in case of the SEM setting. For the FINANCIAL dataset this 
ratio is equal 26% on average, for the rLUBM dataset 13% on average, and for 
the rSWRC 11% on average. Thus, in case of the FINANCIAL dataset relatively 
the least computation is done to evaluate useless candidate patterns. In case of the 
rSWRC dataset the computational effort is relatively the largest. 

Summarizing, the semantic tests performed during the pattern generation were 
useful in terms of the number of patterns for all datasets, and in the running 
time for FINANCIAL and rLUBM datasets but not for rSWRC dataset. They 
were most useful for the FINANCIAL dataset, where relatively the least number 
of patterns were generated and tested in the SEM setting in comparison to the 
NOSEM setting, and where the ratio between frequent and candidate patterns in 
the SEM setting was the biggest. The semantic tests were least useful in case of 
the rSWRC dataset. 

After the analysis of the results for rSWRC dataset one may pose the following 
question: should the semantic tests on patterns be performed together with check- 
ing their frequency or they should be performed afterwards as a postprocessing step 
of pattern mining? For the FINANCIAL and rLUBM datasets it is clear that it 
was better to perform the tests together with pattern evaluation. The running times 
in SEM setting (at least for the longest patterns in case of rLUBM) are already 
shorter than those in NOSEM setting. For the rSWRC dataset we performed ad- 
ditional test. We took the patterns generated as the output of NOSEM setting 
(MAXLENGTH — 4) and postprocessed them, leaving at the first step only s-free 
ones, and at the second step only one representative of each equivalence class of 
patterns. The additional execution time was 907.5s, which together with the exe- 
cution time of NOSEM setting, 169.1s (as specified in Table|4]), gives 1076.6s. This 
time is shorter than the time of the SEM setting execution which is 2533.5s. That 
is, in case of iSWRC dataset it was faster to perform data mining without semantic 
tests at the first step and then perform the tests as a postprocessing step. 
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Fig. 4: Results of the experiment, MAXLENGTH=4, FINANCIAL: C^Client, 
rSWRC: C=Person, rLUBM: C=Person. 



Why rSWRC dataset is especially hard for our approach, while the others are 
not, is discussed in Section |5.1.2[ which provides more insight into the influence 
of using intensional background knowledge during pattern generation. It is also 
noteworthy, that for the same intensional background knowledge, but for a big- 
ger number of assertions (more probable case for data mining applications) it may 
be better to perform the semantic tests together with pattern evaluation. As an 
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Fig. 5: Results of the experiment, MAXLENGTH=4. 



empirical proof of this claim we provide the experimental results in the following 
paragraph. 

Influence of the size of the dataset on the effectiveness We measured how our 
method scales in terms of the running time with growing instance data. For this 
reason we used the replication of the axioms in the assertional part of the knowledge 
base for FINANCIAL and rSWRC. Each assertional part of FINANCIAL_« and 
rSWRC_n was obtained by replicating the original assertional part n times. For 
rLUBM, we used the results of the execution of the generator of synthetic data 
(downloaded from KA0N2's testbed). Each assertional part of rLUBM_n was gen- 
erated automatically for the number n of universities. 

In the experimental results (Figure [5]) one can observe that for all datasets, the 
bigger extensional part of the background knowledge, the relatively better perfor- 
mance of our method, SEM, compared to the method, where no semantic tests are 
perfomed during the pattern generation, NOSEM. That is, the overhead needed 
to compute the semantic tests becomes relatively smaller in comparison with the 
time needed to evaluate more queries on bigger sets of data. Especially interesting 
for us are the results on the problematic rSWRC dataset. One can observe that 
together with the growth of the assertional part of the background knowledge, the 
time needed to compute the NOSEM setting increases more then the time needed 
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Fig. 6: Results of the experiment on the influence of the knowledge base expressivity 
on effectiveness, MAXLENGTH=4. 



to compute the SEM setting. That's why, we perfomed the tests to show that for 
bigger volumes of data, described by the same intensional knowledge, a time needed 
to compute a trie of patterns in NOSEM setting finally reaches and exceeds the 
time needed to compute the trie in the SEM setting. 



5.1.2 Influence of the expressivity of the dataset on the effectiveness 
In this experiment, additionally to the parameters measured in the experiment pre- 



sented in Section 5.1.1 we collected the following information: (i) the number of 
candidates generated by the syntactic refinement rules (gen), (ii) the number of 
satisfiable candidate patterns (sat), (iii) the number of semantically free candidate 
patterns (sfree). The goal was to investigate more deeply, how useful are the se- 
mantic tests on different types of datasets. Figure |6] shows the experimental results. 

The most important remark to be made after the analysis of the results is that 
after the satisfiability test, in case of the FINANCIAL dataset many, while for the 
other two datasets none of the patterns were pruned away. We conclude that it is 
due to the disjointness constraints present only in the FINANCIAL dataset, but 
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not in the other ones. If two concepts are defined to be disjoint then adding an atom 
where some variable is described by one of them, while the other one describing the 
same variable is already present in a query is useless. For example, as concepts Man 
and Wom,an are defined to be disjoint in the FINANCIAL datasct, then testing 
atom Woman(key) as a refinement of the pattern Q{key) = C'lient{key), Man(key) 
is useless, and such refinement is pruned due to its unsatisfiability. The impor- 
tant property in the context of the presence of disjointness constraints is also 
the co-occurrence of the role domain and range specifications. For instance, an 
atom with a concept describing a variable that is already described by the range 
of some role already present in a query, may be pruned away if the concept in 
the role range and the given concept are disjoint. For example, in the FINAN- 
CIAL dataset, the range of role hasCreditCard is CreditCard. Then, due to the 
disjointness of concepts CreditCard and Account, a refinement Account{x\) of query 
Q{key) = Client{key), hasCreditCard{key , xi) is pruned after the satisfiability test. 

Furthermore, after the analysis of the results, we conclude that the features of 
the intensional part of the knowledge base of the tested datasets, that helped to 
prune patterns after the s-freeness test were: the organization of concepts and roles 
in taxonomies, the specification of domain and ranges of roles, the specification of 
role properties, such as role inverse, and concept definitions. 

In the FINANCIAL dataset the hierarchy of roles is flat, the hierarchy of con- 
cepts maximum 4 levels deep. In the rSWRC dataset the hierarchy of roles is also 
flat, and the hierarchy of concepts is maximum 5 levels deep. In the rLUBM dataset 
the hierarchy of roles is almost flat, with two exceptions, 2 and 3 levels deep, the 
hierarchy of concepts is maximum 5 levels deep. 

In the FINANCIAL dataset all roles have domain and ranges explicitly specified. 
There are also some axioms defining inverse roles. In the rSWRC dataset domains 
and ranges are nearly not specified explicitly (except with one exception). There 
are, however, restrictions imposed on ranges of some roles while used with particu- 
lar concepts in a role domain, for example when concept AcademicStaff is used as 
a domain of the role headOfGroup the range of the role can only be ResearchGroup: 
AcademicStaff C yheadOf Group. ResearchGroup. There are many axioms specify- 
ing inverse roles in rSWRC. In rLUBM dataset domains and ranges of some roles 
are explicitly specified, there are also some inverse role specifications. 

Concept definitions are only present in the rLUBM dataset. The example con- 
cept definition of concept Student as a person taking some course is: Student = 
Person □ 3takesCourse. Course. Hence, the atom Student{key) as the refinement 
of query Q(key) = Person{key), takesGourse{xi), Course{xi) is pruned after the 
s-freeness test. 

In the context of the query equivalency test (when performed after the s-freeness 
test), the important features of the intensional part of the tested datasets are: spec- 
ification of role properties, such as role inverse, and concept definitions. 

Let us consider for example the following queries, Qi and Q2, tested for semantic 
uniqueness w.r.t. the FINANCIAL dataset: 

Qi{key) = Client{key), isOwnerOf{key, xi), hasPernianentOrder{xi, X2) 
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Table 5: Results of the experiment. FINANCIAL: minsup— 0.2, C=Client, 
rSWRC: minsup— 0.3, C— Person, rLUBM: minsup— 0.3, C— Person 



Dataset 




runtime [s] 
SEM SEM+TAX 



speedup 
SEM/SEM+TAX 



FINANCIAL 
rSWRC 
rLUBM 



12 
4 
4 



30821.6 
2533.5 
3486.7 



23146.4 
1594.4 
2232.4 



1.33 
1.59 
1.56 



Q2{key) = Client(key), hasOwner(xi, key), hasPermanentOrder{xi, X2) 

Since the role hasOwner is an inverse of the role isOwnerOf , both patterns have 
exactly the same meaning, and one of them is semantically redundant. 

Concluding, from the analysis of the results and of the tested dataset features, 
it follows that the presence or lack of the disjointness constraints in a dataset, is 
a crucial feature. It is also desirable that disjointness constraints co-occur together 
with the specification of role domains and ranges. Disjointness constraints allow 
to prune unsatisfiable patterns before any query answering procedure execution, 
either on {KB, P) or on a copy of [KB, P). 

Moreover, the question posed in Section f5.1.1[ why the rSWRC dataset is espe- 
cially hard for our approach, has been indirectly answered by the analysis presented 
in this section. The rSWRC is not very expressive, as it does not contain disjoint- 
ness constraints, nor explicit role domain and range specifications, any concept 
definitions, and its role hierarchy is flat. Hence, the gain that may be achieved by 
using intensional background knowledge for this dataset cannot be large. 



5.1.3 Using taxonomies of concepts and roles in building pattern refinements 

Table [5] presents the results of the experiment on the use of concept and role tax- 
onomies to build pattern refinements in line with hierarchical information from a 
KB [SEM + TAX) in addition to the setting SEM. For all the datasets speedup 
has been achieved, growing with the increasing maximum length of patterns. 



6 Related work 

We start the discussion in this section, from the features of knowledge representation 
languages admitted by the (onto-)relational frequent pattern mining approaches. 
Further we discuss how the related approaches exploit the semantics of their ad- 
mitted representation languages. 

The relational frequent pattern miners, WARMR, FARMER, and c-armr, are 
designed to operate on knowledge bases represented as logic programs (generally 
in a Datalog variant). Thus, by definition, they are not able to use the knowledge 
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that has the form of description logic axioms that could not be rewritten into Cat- 
alog. Consider the knowledge base from Example [T] In Datalog it is impossible to 
assert that all accounts must have an owner without explicitly specifying who the 
owner is. Let us show how this may affect the properties of patterns generated by 
a method. Consider the following pattern (query): 

Q{x) =? — Account(x), Property(x) 

The second atom of Q may be considered as redundant, as from the knowledge base 
we already know that every account is a property, and we know also that if some- 
thing has an owner, than it is a property. Moreover, it is stated that every account 
has an owner. For example, account2 is a property, even if there is nowhere written 
so, and the owner of account2 is nowhere specified. In Datalog such deduction, in 
order to catch the redundancy, is impossible. 

DL-safe rules formalism allows modelling the rules with disjunctions of atoms 
in their heads. Hence, despite of admitting an additional component of the knowl- 
edge base (in description logic) , we also extend the language of logic programs used 
by other methods from Datalog to disjunctive Datalog. Summarizing, WARMR, 
FARMER and c-armr operate only on a part of the one of the two components 
assumed in our approach, namely only on the Datalog part of disjunctive Datalog. 

SPADA (in further versions named v4£-QuIn) has been the only approach so 
far to frequent pattern discovery in combined knowledge bases, more specifically 



the knowledge bases expressed in ^£-log ( Donini et al. 1998 ) . AC-\og is the com- 
bination of Datalog with ACC description logic, and hence our approach supports 
more expressive language SHTJ- in the description logic component. The early ver- 
sion of SPADA/^£-QuIn admitted only ACC atomic concepts as the structural 
knowledge (i.e., taxonomies) whereas roles and complex concepts have been disre- 
garded. ^£-QuIn does not have this restriction. Rules in SPADA/^£-QuIn are 
represented as constrained Datalog clauses. In these clauses, only DL concepts can 
be used (as constraints in the body). While in DL-safe rules using both concepts 
and roles in DL-atoms is allowed and DL-atoms can be used in rule heads as well. 
DL-safe rules are applicable only to explicitly named objects. The fact that atoms 
with concept predicates can occur only as constraints in the body of ^£-log rules 
has the similar effect. 

The actual representation considered in the SPADA/yl£-QuIn is a special kind 
of Datalog (where description logic concepts serve as constraints in the Datalog 



clauses) (Lisi 2007). In the core of ^£-QuIn, description logic axioms are compiled 



into Datalog ones and appended to Datalog component (for the details see: (Lisi 



2007)). In turn, the DL-safe rules extend expressive description logics with disjunc- 
tive clauses. In the core of the reasoning mechanism proposed for DL-safe rules 
there is an inverse direction: description logic knowledge base is translated into a 
disjunctive Datalog program and rules are appended to the result of this transla- 
tion. Note, that the first approach, consisting on computing consequences of the 
DL component first, and then applying the rules to these consequences is incorrect 
in general. Consider the following knowledge base {KB, P): 
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Human{x) ^ Man{x),0{x) 
Human (x) ■<r- Woman(x),0{x) 
{ManU Woman){Pat). 

The assertion of individual Pat to concept {ManU Woman) means that Pat is a Man 
or a Woman, but it is not known whether Pat is a man or a woman. Either of the 
rules from {KB,P) derives that Pat is a Human, hence {KB,P) \= Human{Pat). 
It could not be derived by applying these rules to the consequences of KB, since 
KB /\=Man{Pat) and KB ^Woman{Pat). Thus, in order to perform the trans- 
lation from the description logic to Datalog, .4£-QuIn has to assure that all the 
concepts are named, which is not a restriction in our approach. 

Eventually, the language used in SPADA/^£-QuIn corresponds to Datalog, 
while that used in our approach to the, more expressive, disjunctive Datalog. 

The relational data mining methods, WARMR and FARMER, use 9-subsumption 
as the generality measure. The 0-subsumption is a syntactic generality relation and 
as such it is not strong enough to capture semantic redundancies. For the knowledge 
base from Example [T] WARMR may discover queries like the following one: 

Q{x, y, z) =? — p_family Account (x, y, z), p_shared Account {x, y, z) 

In the knowledge base, p _family Account is defined as a type of p _shared Account . 
This makes the second atom of Q, p _shared Account (x, y, z), semantically redun- 
dant. The 0-subsumption is to weak to use such taxonomic information. Using the 
syntactic generality measure causes redundancy not only in a single pattern, but 
also in a set of patterns. Consider, for example, the following queries that would be 
both discovered by WARMR: 

Qi{x, y, z) =? — p_familyAccount(x, y, z) 
Q2{x, y, z) =? — p _family Account {x , y, z), p _shared Account (x , y, z) 

Under a semantic generality measure they would be equivalent to each other. 

Both c-armr and SPADA have been conceived to use a semantic generality mea- 
sure, but SPADA does not fully exploits it in an algorithm for pattern mining 
to avoid generation of semantically redundant patterns. It is not used either to 
prune patterns semantically redundant, due to redundant literals nor to prune se- 
mantically equivalent patterns. That is, similar pattern as described above w.r.t. 
WARMR, would be generated by SPADA: 

q{y) p _woman{y) , p _familyAccount{x , y, z), p^shared Account (x, y, z)hClient{y) 

Also, there is no solution in SPADA algorithm to check the redundancy using the 
knowledge linking Datalog and description logic component (like the rule defining 
p -family Account) . The following clause may be generated: 

q{x) p -family Account (x, y, z), p-woman(y) & Account{x), Client{y) 

while from the {KB,P) already follows the constraint Client on variable y. Since 
the second argument of p Jamily Account describes an owner, and being an owner 
implies being a client, the atom Client{y) is redundant w.r.t. {KB, P). 
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It is also interesting to note here, that in c-armr not only semantically free (s- 
free), but also semantically closed (s-closed) patterns are generated. Generation 
of semantically closed patterns is based on the assumption that each s-free clause 
has a unique s-closed clause (s-closure) and several s-free clauses may have the 
same s-closure. This is valid for Datalog, as s-closures are computed based on the 
computation of the least Herbrand models. In the context of our approach, this as- 
sumption is not valid anymore. A combined knowledge base {KB, P) is translated 
into a disjunctive Datalog program, where there may be possibly many minimal 
models. Additionally, if there are transitive roles defined in a knowledge base, in 
order to be s-closed, a query should contain the transitive closure of atoms with 
such role. It may cause that a query contains many, possibly not interesting atoms 
and leads to additional computations. Taking into account the abovementioned is- 
sues, we decided to generate only s-free queries. 

Relational frequent pattern mining algorithms usually generate patterns accord- 
ing to a specification in a declarative bias. Declarative bias allows to specify the 
set of atom templates describing the atoms to be used in patterns. Common so- 
lution is to take one atom template after the other to build the refinements of a 
pattern, in order in which the templates are stored in a declarative bias directives. 
Such solution is adopted in WARMR, FARMER, c-armr. However, the solution 
does not make use of a semantic relationships between predicates in atoms, causing 
redundant computations. Consider the patterns: 

Qi{y) =? —p-Woman{y),p_sharedAccount{x,y,z) 
Q2{y) =? — p-Woman{y), p_familyAccount{x, y, z) 

Assume that pattern Qi has been found infrequent. Thus, generating pattern Q2 
is useless, since p_family Account is more specific than p_shared Account. However, 
c-armr would generate and test both queries anyway. If some taxonomic informa- 
tion was used to systematically generate refinements, the redundant computation 
could be avoided. In SPADA, taxonomic information is used only with regard to 
the concept hierarchies. That is, patterns are refined by replacing more general 
concepts by more specific ones in the constraints of the constrained Datalog clause. 
Any technique using taxonomic information is not reported with regard to the Dat- 
alog predicates. It means, the same scheme of refining patterns, described above, is 
apphed in SPADA/yl£-QuIn, too. 

Finally, in Table [6j we provide the comparison of the semantic features of the 
approaches to (onto-)relational frequent pattern mining. In the last row we provide 
the features of our approach, SEMINTEC. 

With regard to the languages used, all of the approaches are able to operate 
on a relational component. Only SPADA/^£-QuIn and our proposed approach, 
SEMINTEC, are the systems designed to take a description logic component into 
account. Moreover, SPADA/^£-QuIn is only able to use concepts in patterns, 
but it is not able to use any roles. Additionally, only the representation used in 
SEMINTEC allows disjunctions of atoms in rule heads. With regard to the fea- 
tures of the algorithms, only c-armr, SPADA/ AC-Quln and SEMINTEC use the 
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Method 
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u -a 



2 



WAR MR 
FARMER 
c-armr 

SPADA/yi£-QuIn 
SEMINTEC 



semantic generality measure, with the consequences described earlier in this section. 
Only c-armr and SEMINTEC apply a technique to prune semantically redundant 
literals from patterns, by checking s-freeness property. Summarizing, SEMINTEC 
is the only approach having all the features presented in Table [6] 

7 Conclusions and future work 

In this paper we have proposed a new method for frequent pattern discovery from 
knowledge bases represented in a combination of ST-LTJ- description logic and DL- 
safe rules. We have focused on the relation of the semantics of the representation 
formalism to the task of frequent pattern discovery, as this is a key aspect to the 
design of (onto-)relational frequent pattern discovery methods. For the core of our 
method we have proposed an algorithm that applies techniques that exploit the 
semantics of the combined knowledge base. 

We have developed a proof-of-concept implementation of this method using the 
state-of-the-art reasoning techniques. We have empirically shown that using the 
intensional part of the combined knowledge to perform semantic tests on candidate 
patterns can make data mining faster. This is because the semantic tests help to 
prune useless patterns before their evaluation, and they help to avoid the futile 
search of large parts of the pattern space. We have also shown that exploiting the 
semantics of a knowledge base can improve the quality of the set of patterns pro- 
duced: the patterns are more compact through the removal of redundant atoms, 
and more importantly, there are fewer patterns, as only one pattern is produced 
from each semantic equivalence class. 

The primary motivation for our work is the real-world need of the Semantic Web 
for data-mining methods. For example large amounts of biological data are now be- 
ing represented using descriptions logics and rules, and there is a scientific need to 
find frequent patterns in this data. Our method is a baseline for future work in this 
area that may be twofold. Firstly, after careful investigation of a particular needs 
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of prominent application domains, the scope of the method may be extended by 
considering more expressive languages falling into the framework of DL-safe rules. 
Secondly, we plan to develop optimization techniques and heuristic algorithms for 
which the proposed method (complete w.r.t. to the pattern space search) would be 
a point of reference. 
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